McGill University
SIAM CT25 Tutorial
30 Jul, 2025
But for practical algorithms, we care about rates of convergence
Stochastic approximation with \(f(θ) = -0.5 θ\) and \(α_t = C/(t+1)\)
Stochastic approximation. Given a filtration \(\{\ALPHABET F_t\}_{t \ge 0}\), adapted sequences \(\{α_t\}_{t \ge 0}\) and \(\{\xi_t\}_{t \ge 1}\), conisder the iteration:
\[ θ_{t+1} = θ_t + α_t \bigl[ f(θ_t) + ξ_{t+1} \bigr], \quad t \ge 1 \]
F1. There is a unique \(θ^* \in \reals^d\) such that \(f(θ^*) = 0\).
F2. The function \(f\) is globally Lipschitz continuous with constant \(L\). \[ \NORM{θ_1 - θ_2}_2 \le L \NORM{θ_1 - θ_2}_2, \quad \forall θ_1, θ_2 \in \reals^d. \]
\[ θ_{t+1} = θ_t + α_t \bigl[ f(θ_t) + ξ_{t+1} \bigr], \quad t \ge 1 \]
N1. Martingale difference noise. The noise is a Martingale difference sequence, i.e., \[ \EXP_t[ ξ_{t+1} ] = 0, \hskip 0.5em a.s. \]
N2. Controlled growth of variance. There exists a \(σ^2\) such that \[ \EXP_t[ \NORM{ξ_{t+1}}^2_2 ] \le σ^2(1 + \NORM{θ - θ^*}_2^2), \hskip 0.5em a.s. \]
There exists a twice-differentiable Lyapunov fn \(V \colon \reals^d \to \reals\) that satisfies
V1. Squared norm equivalent. There exist \(a,b > 0\) such that \[ a \NORM{θ - θ^*}_2^2 \le V(θ) \le b \NORM{θ - θ^*}_2^2, \quad \forall θ \in \reals^d. \]
V2. Smoothness. There exists \(M > 0\) such that \[ \NORM{\GRAD V(θ_0) - \GRAD V(θ_2) }_2 \le 2M \NORM{θ_1 - θ_2}_2, \quad \forall θ_1, θ_2 \in \reals^d. \]
V3. Exponential stability. There exists \(ρ > 0\) such that \[ \IP{\GRAD V(θ)}{f(θ)} < - ρ V(θ), \quad \forall θ \in \reals^d \setminus \{θ^*\}. \]
Implication 1 (F1), (F2), and (V1) imply \[ \NORM{f(θ)}_2^2 = \NORM{f(θ) - f(θ^*)}_2^2 \le L \NORM{θ_t - θ^*}_2^2 \le \textcolor{red}{L} V(θ). \] and \[ \EXP_t[ \NORM{ξ_{t+1}}_2^2 ] \le σ^2(1 + \NORM{θ_t - θ^*}_2^2) \le \textcolor{red}{σ^2}(1 + V(θ_t)) \]
Implication 2 (V2) implies \[ V(θ + η) \le V(θ) + \IP{\GRAD V(θ)}{η} + M \NORM{η}^2_2. \]
Lyapunov drift plus diffusion
Let \(\{a_t\}_{t \ge 0}\), \(\{b_t\}_{t \ge 0}\), \(\{u_t\}_{t \ge 0}\) be real-valued sequences with \(1 + a_t > 0\) that satisfy \[ u_{t+1} \le (1 + a_t)u_t + b_t. \] Define \(Φ_{t,T} = \prod_{s=t}^{T-1}(1 + a_s)\). Then,
\[ u_T \le Φ_{0,T} u_0 + \sum_{t=0}^{T-1} Φ_{t+1,T}b_k. \]
Let \(Φ_{t,T} = \sum_{s=t}^{T-1} (1 - ρ α_t + α_t^2(L^2 + σ^2))\). Then \[ \EXP[V(Θ_T)] \le Φ_{0,T} \EXP[ V(θ_0) ] + σ^2 \sum_{t=0}^{T-1} Φ_{t+1,T} α_t^2 \]
If \(\{α_t\}_{t \ge 0}\) is a decreasing sequence, at some time \(T_0\), \(α_t\) becomes smaller than \(ρ/2(L^2 + σ^2)\). Then, \[ 1 - ρ α_t + α_t^2 (L^2 + σ^2) < 1 - \tfrac 12 ρ α_t, \quad \forall t \ge T_0. \]
Therefore, for \(t \ge T_0\), \[ Φ_{t+1,T} \le \prod_{s=t+1}^{T-1}(1 - \tfrac 12 ρ α_s) \le \exp\left(-\frac ρ2 \sum_{s=t+1}^{T-1} α_s \right). \]
So, for the zero input response, we simply need the rate of growth of \(\sum α_s\)
The zero state response is trickier (more on that later).
\[ \sum_{t=0}^{T-1} \dfrac{1}{(t+1)^q} \le φ_q(T) \coloneqq \begin{cases} \dfrac{T^{1-q}}{1-q}, & q \in (0,1) \\ \log T, & q = 1 \\ < ∞, & q > 1 \end{cases} \]
From rate of growth of \(\sum α_s\) we have \[ Φ_{0,T} \le \exp\left( - \frac{ρC}{2(1-q)} T^{1-q} \right) \] which decays sub-exponentially.
Moreover, assume that \(T_0\) is large enough so that \(1/(t+1) < C/(t+1)^q\). Then, for any \(t +1 \ge T_0\), \[ Φ_{t+1, T} \le \prod_{s=t+1}^{T-1} \left( 1 - \frac{1}{s+1} \right) = \frac{t}{T} \]
The zero input term decays as \[ Φ_{0,T} \EXP[ V(θ_0) ] \le \EXP[ V(θ_0) ]\exp\left( - \frac{ρC}{2(1-q)} T^{1-q} \right) \]
The tail of zero state term decays as \[ σ^2 \sum_{t=\textcolor{red}{T_0}}^{T-1} Φ_{t+1,T} α_t^2 \le \frac{α^2C}{T} \sum_{t=T_0} \frac{1}{(t+1)^{2q-1}} \le \frac{α^2C}{2(1-q)} \frac{1}{T^{1-2q}} \]
From rate of growth of \(\sum α_s\), we have \[ Φ_{t+1,T} \le \exp\left(- \frac{ρC}{2} \log \left(\frac{T}{t+1}\right)\right) = \left(\frac{t+1}{T}\right)^{ρC/2}. \]
The zero input term decays as \[ Φ_{0,T} \EXP[ V(θ_0) ] \le \EXP[ V(θ_0) ] \left(\frac{1}{T}\right)^{ρC/2} \]
The tail of zero state term decays as \[ σ^2 \sum_{t=\textcolor{red}{T_0}}^{T-1} Φ_{t+1,T} α_t^2 \le \frac{C^2}{T^{ρC}} \sum_{t=T_0}^{T-1} \frac{1}{(t+1)^{2 - \frac{ρC}2}} \le \frac{C^2}{T^{ρC}} Φ_{2 - ρC/2}(T) \]
Overall Rate depends on \(ρC\)
\[ \EXP[ V(θ_T) ] \le \begin{dcases} \mathcal{O}\left( \frac 1T \right), & ρC > 2 \\[5pt] \mathcal{O}\left( \frac {\log(T)}T \right), & ρC = 2 \\[5pt] \mathcal{O}\left( \frac {1}{T^{ρC/2}} \right), & ρC < 2 \end{dcases} \]
Stochastic approximation with \(f(θ) = -0.5 θ\) and \(α_t = C/(t+1)\)
In some settings, \(H\) is a weighted \(\ell_2\)-norm contraction.
In general, \(H\) is an \(\ell_{\infty}\)-norm contraction.
The Lyapunov function does not satisfy (V2) (smoothness).
Can consider Moreau envelop \[ M_{λ}(θ) = \min_{φ \in \reals^d} \left\{ \NORM{φ}_{∞} + \frac{1}{2 λ} \NORM{θ - φ}_2^2 \right\} \] which is close to \(\NORM{⋅}_{∞}\) and is \(1/λ\)-smooth (Chen et al. 2020).
V3’. There exists a \(ρ > 0\) and \(c > 0\) such that for all \(θ \in \reals^d\), \[ \IP{ \GRAD V(θ) }{ f(θ) } \le c - ρV(θ) \]
Taking \(α_t = C/\sqrt{(t+1)}\) gives
\[ \EXP[Δ_{τ}] \le \mathcal{O}\left( \frac{\log T}{\sqrt{T}} \right). \]
Poorer rate, but does not require exponential stability assumption.
Simple derivation of non-asymptotic rate of convergence for stochastic approximation
Key ideas:
Quite a rich area. Lot of interest in relaxing the regularity assumptions on \(f\) and \(V\), depending on specific application.